Figure caption: Main causes of deforestation in central Menabe. a-a’: Slash-and-burn agriculture (“hatsake”) for peanut crop. Peanut (a’) is cultivated as a cash crop. Part of the production is at the destination of the national market but most of it is exported outside Madagascar, mainly for the Chinese market. b-b’: Slash-and-burn agriculture for maize crop. maize (b’) is cultivated for auto-consumption and as a cash crop. The production of maize is at the destination of the national market and is used in particular to brew the national beers. c-c’: Cyclone followed by uncontrolled fires. Cyclone “Fanele” (2009) caused tree mortality and accumulation of wood fuel on the ground. As a consequence, uncontrolled fires set on nearby pastures (c’) spread over large areas of forest after 2009. d-d’: Illegal logging. Timbers are used for house and boat construction.
This tutorial is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
In RStudio, open a terminal and clone the GitHub repository locally:
git clone https://github.com/ghislainv/forecasting-deforestation-Mada
Navigate with RStudio in the folder named forecasting-deforestation-Mada and open the R Notebook file named training.Rmd. R Markdown files (*.Rmd) can be used in R to embbed text, code, table and figues inside a unique document. Code are organized into ‘chunks’ that can be executed independently and interactively. An R Notebook is an R Markdown document with output visible immediately beneath the input.
I used this notebook to write the tutorial available at https://forestatrisk.cirad.fr/tutorial.
In the preambule, change the author and date with your name and today’s date.
# Libraries
pkg <- c("reticulate","curl","dplyr","readr","broom",
"ggplot2","raster","rasterVis","rgdal","leaflet","kableExtra","sf")
load.pkg <- function(x) {
if(!require(x, character.only=TRUE)) {
install.packages(x)
require(x, character.only=TRUE)
}
}
loaded <- lapply(pkg,load.pkg)
## Remove useless objects
rm(pkg,load.pkg,loaded)To install Miniconda3 with Python3, follow the instructions here: https://conda.io/docs/user-guide/install/index.html
Using functions from the reticulate R package, install the conda environment with a selection of Python packages.
library(reticulate)
if (!("forestatrisk" %in% conda_list()$name)) {
# Create conda virtual environment
conda_create("forestatrisk")
# Install python ackages in virtual environment
conda_pkg <- c("gdal","numpy","pandas","patsy","pip",
"matplotlib", "scipy", "scikit-learn",
"statsmodels")
conda_install("forestatrisk", conda_pkg, forge=FALSE)
# Install forestatrisk from git with pip
git_repo <- "https://github.com/ghislainv/forestatrisk/archive/master.zip"
conda_install("forestatrisk", git_repo, pip=TRUE, pip_ignore_installed=FALSE)
py_discover_config(required_module="forestatrisk",
use_environment="forestatrisk")
}Specify the Python version to use with reticulate and check that it is the right one.
# On RStudio server (uncomment the following two lines)
Sys.unsetenv("DISPLAY")
use_python("/usr/bin/python3")
# On RStudio desktop (uncomment the following lines)
#use_condaenv("forestatrisk")
# Check python configuration
py_config()## python: /usr/bin/python3
## libpython: /usr/lib/python3.7/config-3.7m-x86_64-linux-gnu/libpython3.7.so
## pythonhome: //usr://usr
## version: 3.7.3 (default, Apr 3 2019, 05:39:12) [GCC 8.3.0]
## numpy: /usr/lib/python3/dist-packages/numpy
## numpy_version: 1.16.2
##
## python versions found:
## /usr/bin/python3
## /usr/bin/python
## /home/ghislain/miniconda3/envs/forestatrisk/bin/python
## /home/ghislain/miniconda3/bin/python
## /home/ghislain/tensorflow/bin/python
Import the Python modules into R.
# Generating data
nobs <- 100
x.seq <- runif(n=nobs,min=0,max=100)
a.target <- 2
b.target <- 2
sigma2.target <- 300
y.seq <- a.target + b.target*x.seq +
rnorm(n=nobs,mean=0,sd=sqrt(sigma2.target))
# Data-set
Data <- as.data.frame(cbind(y.seq,x.seq))
# Plot
plot(x.seq,y.seq)##
## Call:
## lm(formula = y.seq ~ x.seq, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.933 -10.025 -1.533 10.973 36.803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.65654 3.61458 -1.565 0.121
## x.seq 2.07924 0.06112 34.019 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.16 on 98 degrees of freedom
## Multiple R-squared: 0.9219, Adjusted R-squared: 0.9211
## F-statistic: 1157 on 1 and 98 DF, p-value: < 2.2e-16
## (Intercept) x.seq
## -5.656535 2.079240
## [1] 291.6321
## [1] 98
# Graphics
x.pred <- seq(from=0,to=100,length.out=100) # We will predict y for 100 values of x
X.pred <- as.data.frame(x.pred)
names(X.pred) <- "x.seq"
Y.pred <- predict.lm(M,X.pred,interval="confidence",
type="response")
plot(x.seq,y.seq,col=grey(0.7),
xlab="X",
ylab="Y",
main="Simple linear regression") # Row data
# Confidence envelops for predictive posterior
lines(x.pred,Y.pred[,1],col="red",lwd=3)
lines(x.pred,Y.pred[,2],col="blue",lwd=2,lty=2)
lines(x.pred,Y.pred[,3],col="blue",lwd=2,lty=2)You can execute Python code within a R notebooks. We need to indicate which version of Python to use in the notebook for Python chunk.
## <class 'numpy.ndarray'>
## [21 6 7 3 10 13]
To model the spatial probability of deforestation, we need a map of the past deforestation and maps of potential explicative environmental variables. Environmental variables can be derived from topography (altitude and slope), accessibility (distance to roads, towns, and forest edge), deforestation history (distance to previous deforestation) or landscape policy (protected area network) for example. In our example, we use the following variables :
tab1 <- read.table("tables/variables.txt",sep=",",header=TRUE)
names(tab1) <- c("Product","Source","Variable derived","Unit","Resolution (m)")
# Kable
knitr::kable(tab1, linesep="") %>%
kable_styling(bootstrap_options = c("striped"))| Product | Source | Variable derived | Unit | Resolution (m) |
|---|---|---|---|---|
| Deforestation maps (1990-2000-2010) | BioSceneMada (1) | forest/non-forest | – | 30 |
| distance to forest edge | m | 30 | ||
| distance to previous deforestation | m | 30 | ||
| Digital Elevation Model | SRTM v4.1 CSI-CGIAR (2) | altitude | m | 90 |
| slope | ° | 90 | ||
| Highways | OSM - Geofabrik (3) | distance to roads | m | 150 |
| Places | distance to towns | m | 150 | |
| Waterways | distance to river | m | 150 | |
| Protected areas | Rebioma (4) | presence of protected area | – | 30 |
In our example, fordefor.tif is a forest raster at 30m for the year 2010 considering the deforestation on the period 2000-2010 in Madagascar. We can plot this raster and zoom on a region with function .plot.forest() in the forestatrisk package to see the deforestation data. The remaining forest appears in green and the deforested areas appear in red.
# Make output directory
if (!dir.exists("output")) {
dir.create("output")
}
# Plot forest cover change 2000-2010
fig <- far$plot$fcc(input_fcc_raster="data/models/2000-2010/fordefor.tif",
output_file="output/fcc.png",
col=c(255,0,0,255), # rgba color for deforestation
figsize=c(5,5),
dpi=150,
zoom=c(340000,412000,7420000,7500000))
knitr::include_graphics("output/fcc.png")We use the function .sample() from the forestatrisk module to sample 10,000 points (pixel centers) in the deforested areas and in the remaining forest (20,000 points in total). The input_forest_raster argument defines the path to the forest raster including the deforested pixels (with value=0) and the remaining forest pixels (value=1) after a given period of time. The random seed can be set with argument Seed to reproduce the data of the random sampling.
The .sample() function also extracts information from environmental variables for each sampled point. Sampling is done by block to allow the computation on large study areas (e.g. country or continental scale) with a fine spatial resolution (e.g. 30m). The var_dir argument indicates the directory including all the environmental raster files (they must be GeoTIFF files with extension .tif) that we want to test.
The .sample() function identifies the spatial cell for each sample point (sample point are grouped within a spatial cell). Spatial cells and grouped observations are used to estimate the spatial autocorrelation of the deforestation process. The csize argument define the width (in km) of the square spatial cells. Each spatial cell will be attributed a parameter. To avoid estimating too many parameters, width of the square spatial cells must be sufficiently large. Both points sampling and extraction of environmental data are done by block to avoid memory problems for big datasets.
## Help on function sample in module forestatrisk.sample:
##
## sample(nsamp=10000, Seed=1234, csize=10, var_dir='data', input_forest_raster='forest.tif', output_file='output/sample.txt', blk_rows=0)
## Sample points and extract raster values.
##
## This function (i) randomly draw spatial points in deforested and
## forested areas and (ii) extract environmental variable values for
## each spatial point.
##
## :param nsamp: number of random spatial points.
## :param seed: seed for random number generator.
## :param csize: spatial cell size in km.
## :param var_dir: directory with raster data.
## :param input_forest_raster: name of the forest raster file (1=forest, 0=deforested) in the var_dir directory
## :param output_file: path to file to save sample points.
## :param blk_rows: if > 0, number of lines per block.
## :return: a pandas DataFrame, each row being one observation.
# Training and validation data-set
if (!file.exists("output/sample-2000-2010.txt")) {
samp <- far$sample(nsamp=10000L, Seed=1234L, csize=10L,
var_dir="data/models/2000-2010",
input_forest_raster="fordefor.tif",
output_file="output/sample-2000-2010.txt",
blk_rows=1L)
}
samp <- read.table("output/sample-2000-2010.txt", header=TRUE, sep=",")
set.seed(1234)
train <- sample(1:20000, size=10000, replace=FALSE)
data_train <- samp[train,] %>% dplyr::filter(complete.cases(.))
data_valid <- samp[-train,] %>% dplyr::filter(complete.cases(.))## altitude dist_defor dist_edge dist_river dist_road dist_town fordefor sapm
## 1 62 258 60 4460 5972 8580 0 0
## 2 253 703 703 2970 2754 4621 0 1
## 3 116 2145 268 14829 14086 14335 0 0
## 4 164 1834 30 3189 4053 4053 0 1
## 5 203 268 170 6128 12399 13064 0 0
## 6 112 7778 324 7270 8396 6941 0 0
## slope X Y cell
## 1 4 774525 7451835 10010
## 2 7 407595 7402695 10297
## 3 1 344865 7499415 9562
## 4 2 403395 7395705 10378
## 5 1 449445 7283475 11274
## 6 10 1043835 8463045 1775
Sampled observations can be plotted using function .plot.obs() from the forestatrisk module. Dark red dots indicate deforestation observations and dark green dots indicate forest observations.
# Plot sample points
fig <- far$plot$obs(sample=data_train,
name_forest_var="fordefor",
input_fcc_raster="data/models/2000-2010/fordefor.tif",
output_file="output/obs.png",
zoom=c(340000,412000,7420000,7500000),
figsize=c(5,5),#c(11.69,8.27),
s=5,dpi=300)
knitr::include_graphics("output/obs.png")Before modelling the deforestation, it is important to look at the relationship between environmental variables and deforestation. Using formulas from the patsy Python module, we can specify the relationships that we want to look at. In the example below, we plot the relationships between some continuous environmental variables and the probability of deforestation using function .plot.correlation() from the forestatrisk package. Note that -1 must be set at the end of the formula. The function .correlation() returns a serie of graphics that can be analyzed to choose the right relationship for each continuous variable (linear or polynomial for example).
# Descriptive statistics
# Model formulas
formula_1 <- paste0("fordefor ~ dist_road + dist_town + dist_defor +",
"dist_river + dist_edge + altitude + slope - 1")
# Standardized variables (mean=0, std=1)
formula_2 <- paste0("fordefor ~ scale(dist_road) + scale(dist_town) +",
"scale(dist_defor) + scale(dist_river) + scale(dist_edge) +",
"scale(altitude) + scale(slope) - 1")
formulas <- c(formula_1, formula_2)
# Loop on formulas
for (f in 1:length(formulas)) {
# Output file
of <- paste0("output/correlation_",f,".pdf")
# Data
dmat <- patsy$dmatrices(formulas[f], data=data_train, eval_env=-1L,
return_type="dataframe")
# Plots
fig <- far$plot$correlation(y=dmat[[1]],data=dmat[[2]],
plots_per_page=3L,figsize=c(7,8),
dpi=100L,output_file=of)
}In this example (see pdf files produced), we can see that a linear model should be sufficient to represent the relationship between the probability of deforestation and the standardized distance to the nearest road or town. On the contrary, it might be interesting to fit a polynomial model for the the standardized distance to previous deforestation (dist_defor variable) for which the relationship seems non-linear. Several models can be fitted and compared to see if a second-order or third-order polynomial relationship is justified.
We propose to use the Binomial icar model (Vieilledent et al. 2014) to estimate the deforestation probability of a pixel given a set of environmental variables. The Binomial icar model is a linear Binomial logistic regression model including an intrinsic Conditional Autoregressive (icar) process to account for the spatial autocorrelation of the observations. Parameter inference is done in a hierarchical Bayesian framework. The .model_binomial_icar() function from the forestatrisk module includes a Metropolis-within-Gibbs algorithm written in pure C code to reduce computation time.
Figure caption: Parameter inference is done in a hierarchical Bayesian framework. Bayesian statistics rely on the Bayes’ theorem named after Reverend Thomas Bayes. Each parameter has a prior and an approximated posterior probability distribution from which we can compute the mean, standard deviation, credible intervals at 95%, etc.
For the deforestation process it is very important to take into account the spatial autocorrelation of the process with spatial random effects. Indeed, the selected fixed environmental variables are not able to fully explain the spatial variability of the deforestation process, especially when working at large geographical scales, such as the national or continental scale. Spatial random effects allow estimating a higher/lower probability of deforestation in a particular region (associated to unmeasurable or unknow factors) that is different from the mean probability of deforestation derived from the environmental factors included in the model. The Binomial icar model can be described as follow:
Ecological process
\[\begin{equation} y_i \sim \mathcal{B}inomial(\theta_i,t_i) \\ \text{logit}(\theta_i) = X_i \beta + \rho_{j(i)} \end{equation}\]
\(y_i\): random variable for the deforestation process (0 if no deforestation, 1 if deforestation)
\(\theta_i\): probability of deforestation
\(t_i\): number of trials (always 1 in our example)
\(X_i\): vector of values for environmental explicative variables
\(\beta\): vector of fixed effect parameters
\(\rho_j\): spatial random effect
\(j(i)\): index of the spatial entity for observation \(i\).
Spatial autocorrelation
The spatial autocorrelation is modelled with an intrinsic conditional autoregressive (icar) process:
\[\begin{equation} \rho_j \sim \mathcal{N}ormal(\mu_j,V_{\rho} / n_j) \end{equation}\]
\(\mu_j\): mean of \(\rho_{j'}\) in the neighborhood of \(j\).
\(V_{\rho}\): variance of the spatial random effects.
\(n_j\): number of neighbors for spatial entity \(j\).
Figure caption: Representation of the neighborhood for the intrinsic conditional autoregressive (icar) process. Target spatial cell \(j\) has 8 neighbors in this case. Several observations (black points, equivalent to pixel centers in our case) can be located in each spatial cell. Deforestation probability in one spatial cell \(j\) depends on deforestation probability in neighboring cells.
Before running the model, we add a column indicating the number of trials for each observation (1 in our case as we are considering a Bernoulli process). We then remove any observation with non-available data (NA) from the data-set. We also compute the number of neighbors (nneigh) and the neighbor identifiers (adj) for each spatial cell using function .cellneigh from the forestatrisk module.
# Spatial cells for spatial-autocorrelation
neighborhood <- far$cellneigh_ctry(raster="data/models/2000-2010/fordefor.tif",
vector="data/mada/mada38s.shp",
csize=10L, rank=1L)
nneigh <- neighborhood[[1]]
adj <- neighborhood[[2]]
cell_in <- neighborhood[[3]]
ncell <- neighborhood[[4]]
# Udpate cell number in training dataset
cell_rank <- vector()
for (i in 1:nrow(data_train)) {
cell_rank[i] <- which(cell_in==data_train$cell[i])-1 # ! cells start at zero
}
data_train$cell <- cell_rank
# Udpate cell number in validation dataset
cell_rank <- vector()
for (i in 1:nrow(data_valid)) {
cell_rank[i] <- which(cell_in==data_valid$cell[i])-1 # ! cells start at zero
}
data_valid$cell <- cell_rankA model formula must also be defined to specify the explicative variables we want to include in the model. The formula allows specifying some variable transformations (such as standardization in our case). See the patsy module for more information. In our model, we included the following variables: location inside a protected area, altitude, distance to past deforestation (with a degree two polynomial), distance to forest edge, distance to nearest road and distance to nearest town. The formula must end with the name of the variable indicating the spatial cell for each observation (cell in our case).
# Model
mod_icar <- far$model_binomial_iCAR(
# Observations
suitability_formula=formula, data=data_train,
# Spatial structure
n_neighbors=np_array(nneigh,dtype="int32"), neighbors=np_array(adj,dtype="int32"),
# Environment
eval_env=-1L,
# Chains
#burnin=2000L, mcmc=5000L, thin=5L,
burnin=1000L, mcmc=1000L, thin=1L,
# Starting values
beta_start=-99)Once the model has been fitted, we can print a summary of the model showing the parameter estimates. The 95% credible intervals obtained from the posterior distribution of each parameter, except distance to nearest town (dist_town), do not include zero, indicating that parameters are significantly different from zero. The variance of the spatial random effects (Vrho) is given together with the deviance value, which can be used to compare different statistical models (lower deviance is better). Looking at the parameter estimates, we can see that the deforestation probability is much lower inside protected areas and that deforestation probability decreases with altitude, slope, distance to past deforestation, forest edge, roads and towns. Parameter values are then coherent regarding the deforestation process and easy to interpret.
## Binomial logistic regression with iCAR process
## Model: I(1 - fordefor) + trials ~ 1 + C(sapm) + scale(altitude) + scale(slope) + scale(dist_defor) + scale(dist_edge) + scale(dist_road) + scale(dist_town) + cell
## Posteriors:
## Mean Std CI_low CI_high
## Intercept -0.294 0.0694 -0.42 -0.143
## C(sapm)[T.1.0] -0.435 0.0926 -0.607 -0.255
## scale(altitude) -0.553 0.0771 -0.699 -0.409
## scale(slope) -0.136 0.0403 -0.209 -0.0531
## scale(dist_defor) -0.424 0.0448 -0.516 -0.343
## scale(dist_edge) -0.677 0.0549 -0.799 -0.582
## scale(dist_road) -0.0335 0.0508 -0.127 0.0988
## scale(dist_town) -0.145 0.0529 -0.248 -0.0501
## Vrho 7.22 0.51 6.31 8.23
## Deviance 9.84e+03 61.5 9.73e+03 9.96e+03
## Binomial logistic regression with iCAR process
## Model: I(1 - fordefor) + trials ~ 1 + C(sapm) + scale(altitude) + scale(slope) + scale(dist_defor) + scale(dist_edge) + scale(dist_road) + scale(dist_town) + cell
## Posteriors:
## Mean Std CI_low CI_high
## Intercept -0.294 0.0694 -0.42 -0.143
## C(sapm)[T.1.0] -0.435 0.0926 -0.607 -0.255
## scale(altitude) -0.553 0.0771 -0.699 -0.409
## scale(slope) -0.136 0.0403 -0.209 -0.0531
## scale(dist_defor) -0.424 0.0448 -0.516 -0.343
## scale(dist_edge) -0.677 0.0549 -0.799 -0.582
## scale(dist_road) -0.0335 0.0508 -0.127 0.0988
## scale(dist_town) -0.145 0.0529 -0.248 -0.0501
## Vrho 7.22 0.51 6.31 8.23
## Deviance 9.84e+03 61.5 9.73e+03 9.96e+03
To check for the convergence of the Markov chain Monte Carlo (MCMC), we can plot the traces and the posterior distributions of the estimated parameters using method .plot() associated to the hSDM_binomial_icar class defined in the deforestprob module. This method returns the figures showing the traces and posterior distributions.
# Get the spatial random effects
rho <- rep(-9999,ncell) # -9999 will be considered as nodata
rho[cell_in+1] <- mod_icar$rho
# Write to GeoTIFF
far$wrast_rho(rho, input_raster="data/models/2000-2010/fordefor.tif",
csize=10,
output_file="output/rho_orig.tif")
# Plot random effects
fig <- far$plot$rho("output/rho_orig.tif",output_file="output/rho_orig.png")
# Plot with R
mada <- rgdal::readOGR(dsn="data/mada",layer="mada38s", verbose=FALSE)
r_rho_orig <- raster("output/rho_orig.tif")
rho_plot(r_rho_orig, mada, output_file="output/rho_orig_ggplot.png",
quantiles_legend=c(0.025,0.975),width=4.5, height=8)## Results plotted to "output/rho_orig_ggplot.png"
We use the model to predict the spatial probability of deforestation at the national scale for Madagascar. Before, doing so, we smooth the spatial random effects which have been estimated at a coarse resolution (10km in our example). To do so, we use the function .resample_rho() from the deforestprob module to resample the results at a finer resolution using a bilinear interpolation. The function writes a raster file to the disk with a resolution of the raster specified in the argument input_raster of the function (1km in our case). The function .resample_rho() returns a figure of the spatial random effects that can be plotted.
# Resample them
fig <- far$interpolate_rho(rho=r_to_py(rho), input_raster="data/models/2000-2010/fordefor.tif",
output_file="output/rho.tif",
csize_orig=10L, csize_new=1L)
# Plot random effects
fig <- far$plot$rho("output/rho.tif",output_file="output/rho.png")
# Plot with R
r_rho <- raster("output/rho.tif")
rho_plot(r_rho, mada, output_file="output/rho_ggplot.png",
quantiles_legend=c(0.025,0.975),width=4.5, height=8)## Results plotted to "output/rho_ggplot.png"
The .predict_raster_binomial_icar() function of the forestatrisk module can be used to predict the spatial probability of deforestation from an hSDM_binomial_icar model (i.e. an object of class hSDM_binomial_icar). The function writes a raster of predictions to the disk. The prediction is done by block to avoid memory problems for big datasets. Functions will return NA for pixels with no forest or for pixels with missing environmental variables.
if (!file.exists("output/prob_icar.tif")) {
far$predict_raster_binomial_iCAR(
mod_icar, var_dir="data/models/2000-2010",
input_cell_raster="output/rho.tif",
input_forest_raster="data/models/2000-2010/fordefor.tif",
output_file="output/prob_icar.tif",
blk_rows=128L)
}The raster of predictions can then be plotted.
if (!file.exists("output/prob_icar.png")) {
fig <- far$plot$prob("output/prob_icar.tif",
output_file="output/prob_icar.png",
figsize=c(4,4))
}
knitr::include_graphics("output/prob_icar.png")Given the spatial probability of deforestation and the number of hectares that should be deforested, we can predict the future forest cover using function .deforest() from the forestatrisk package. The number of hectares are converted into number of pixels to be deforested. Pixels with the highest probability of deforestation are deforested first. The function computes a probability threshold above which pixels are deforested.
In our example, we consider an annual deforestation of roughly 100,000 ha for Madagascar. Considering the period 2010-2050, this would correspond to 4 Mha of deforestation. This number can of course be more precise and refined considering various deforestation scenarios (demographic growth, economic development, etc.).
if (!file.exists("output/proj2050_icar.tif")) {
deforest <- far$deforest(input_raster="output/prob_icar.tif",
hectares=4000000,
output_file="output/proj2050_icar.tif",
blk_rows=128L)
}We can plot the predicted future forest cover in 2050 with leaflet. We first reproject the raster to WGS 84 / World Mercator projection (EPSG:3857) and we resample the map at 250 m using function gdalwarp from GDAL. The 250 m resolution raster that can be transformed into a lower size image that can be plotted with leaflet.
if [ ! -f output/proj2050_icar_epsg3857_ov32.tif ]
then
gdalwarp -overwrite -s_srs EPSG:32738 -t_srs EPSG:3857 -tr 250 250 \
-ot Byte -r near -co "COMPRESS=LZW" -co "PREDICTOR=2" -co "BIGTIFF=YES" \
output/proj2050_icar.tif output/proj2050_icar_epsg3857_ov32.tif
fiWe also set the color of the map and the extent of the view for leaflet.
# Colors and extent view for leaflet
r <- raster("output/proj2050_icar_epsg3857_ov32.tif")
red_rgb <- c(225,26,28)
forestgreen_rgb <- c(51,160,44)
rgbColors <- matrix(c(red_rgb,forestgreen_rgb), ncol=3, byrow=TRUE)
Colors <- rgb(rgbColors/255)
pal <- colorFactor(Colors, c(0,1), na.color = "transparent")
ext_PK32 <- c(340000,412000,7420000,7500000) # PK32-Ranobe
r1 <- raster(nrows=1,ncols=1,ext=extent(ext_PK32),
crs=CRS("+init=epsg:32738"))
r2 <- projectExtent(r1, crs=CRS("+init=epsg:4326"))
ext2 <- extent(r2)The red areas represent the deforestation on the period 2010-2050. The green areas represent the remaining forest in 2050. Most of the remaining forest in 2050 are inside the protected areas or located in remote areas, at high altitudes and far from roads and big cities (for example in the Tsaratanana mountain region and around the Masoala peninsula, north-east Madagascar).
# Null model
formula_null <- "I(1-fordefor) ~ 1"
dmat_null <- patsy$dmatrices(formula_null, data=r_to_py(data_train), NA_action="drop",
return_type="dataframe", eval_env=-1L)
Y <- dmat_null[[1]]
X_null <- dmat_null[[2]]
mod_null <- sm$GLM(Y, X_null, family=sm$families$Binomial())$fit()
print(mod_null$summary())## Generalized Linear Model Regression Results
## ===============================================================================
## Dep. Variable: I(1 - fordefor) No. Observations: 10000
## Model: GLM Df Residuals: 9999
## Model Family: Binomial Df Model: 0
## Link Function: logit Scale: 1.0000
## Method: IRLS Log-Likelihood: -6931.0
## Date: dim., 12 janv. 2020 Deviance: 13862.
## Time: 15:30:45 Pearson chi2: 1.00e+04
## No. Iterations: 3
## Covariance Type: nonrobust
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## Intercept -0.0200 0.020 -1.000 0.317 -0.059 0.019
## ==============================================================================
# Simple glm with no spatial random effects
formula_glm <- paste0("I(1-fordefor) ~ C(sapm) + scale(altitude) + ",
"scale(slope) + scale(dist_defor) + ",
"scale(dist_edge) + scale(dist_road) + scale(dist_town)")
mod_glm <- smf$glm(formula_glm, r_to_py(data_train),
family=sm$families$Binomial(), eval_env=-1L)$fit()
# Summary glm
sink(file="output/summary_mod_binomial_glm.txt")
print(mod_glm$summary())## Generalized Linear Model Regression Results
## ===============================================================================
## Dep. Variable: I(1 - fordefor) No. Observations: 10000
## Model: GLM Df Residuals: 9992
## Model Family: Binomial Df Model: 7
## Link Function: logit Scale: 1.0000
## Method: IRLS Log-Likelihood: -6420.3
## Date: dim., 12 janv. 2020 Deviance: 12841.
## Time: 15:30:46 Pearson chi2: 1.13e+04
## No. Iterations: 5
## Covariance Type: nonrobust
## =====================================================================================
## coef std err z P>|z| [0.025 0.975]
## -------------------------------------------------------------------------------------
## Intercept 0.0331 0.027 1.214 0.225 -0.020 0.087
## C(sapm)[T.1.0] -0.3065 0.048 -6.420 0.000 -0.400 -0.213
## scale(altitude) -0.1816 0.026 -6.975 0.000 -0.233 -0.131
## scale(slope) -0.1595 0.025 -6.262 0.000 -0.209 -0.110
## scale(dist_defor) -0.3617 0.030 -11.979 0.000 -0.421 -0.302
## scale(dist_edge) -0.4841 0.035 -13.990 0.000 -0.552 -0.416
## scale(dist_road) 0.0830 0.023 3.600 0.000 0.038 0.128
## scale(dist_town) -0.1433 0.023 -6.350 0.000 -0.187 -0.099
## =====================================================================================
## Generalized Linear Model Regression Results
## ===============================================================================
## Dep. Variable: I(1 - fordefor) No. Observations: 10000
## Model: GLM Df Residuals: 9992
## Model Family: Binomial Df Model: 7
## Link Function: logit Scale: 1.0000
## Method: IRLS Log-Likelihood: -6420.3
## Date: dim., 12 janv. 2020 Deviance: 12841.
## Time: 15:30:46 Pearson chi2: 1.13e+04
## No. Iterations: 5
## Covariance Type: nonrobust
## =====================================================================================
## coef std err z P>|z| [0.025 0.975]
## -------------------------------------------------------------------------------------
## Intercept 0.0331 0.027 1.214 0.225 -0.020 0.087
## C(sapm)[T.1.0] -0.3065 0.048 -6.420 0.000 -0.400 -0.213
## scale(altitude) -0.1816 0.026 -6.975 0.000 -0.233 -0.131
## scale(slope) -0.1595 0.025 -6.262 0.000 -0.209 -0.110
## scale(dist_defor) -0.3617 0.030 -11.979 0.000 -0.421 -0.302
## scale(dist_edge) -0.4841 0.035 -13.990 0.000 -0.552 -0.416
## scale(dist_road) 0.0830 0.023 3.600 0.000 0.038 0.128
## scale(dist_town) -0.1433 0.023 -6.350 0.000 -0.187 -0.099
## =====================================================================================
# Deviances
deviance_null <- mod_null$deviance
deviance_glm <- mod_glm$deviance
deviance_icar <- mod_icar$deviance
deviance_full <- 0
# Table
dev <- c(deviance_null, deviance_glm, deviance_icar, deviance_full)
mod_dev <- data.frame(model=c("null", "glm", "icar", "full"),
deviance=round(dev))
perc <- 100*(1-mod_dev$deviance/deviance_null)
mod_dev$perc <- round(perc)
write.table(mod_dev,file="output/deviance_model_comparison.txt",sep=",",row.names=FALSE)
mod_dev## model deviance perc
## 1 null 13862 0
## 2 glm 12841 7
## 3 icar 9844 29
## 4 full 0 100
Using the validation data-set, we can compute several model performance indices (Liu, White, and Newell 2011):
It is known that the value of some of these indices might vary with the intensity of deforestation (Pontius et al. 2008).
In our case, we compute these indices for various values of the deforestation rate (in percentage of deforestation observations) in our validation data-set (1, 5, 10, 25, 50). For each percentage, the threshold used to convert the predicted probabilities of deforestation were choosen in order to respect the deforestation rate (in %) in the observations.
set.seed(1234)
performance_null <- performance_index(data_valid,runif(nrow(data_valid)))
performance_nochange <- performance_index(data_valid,rep(0,nrow(data_valid)))
theta_pred <- rep(0,nrow(data_valid))
performance_glm <- performance_index(data_valid,mod_glm$predict(data_valid))
performance_icar <-performance_index(data_valid,mod_icar$predict(data_valid))
# Save
write.table(performance_null, file="output/performance_null.txt",
sep="\t", row.names=FALSE)
write.table(performance_nochange, file="output/performance_nochange.txt",
sep="\t", row.names=FALSE)
write.table(performance_glm, file="output/performance_glm.txt",
sep="\t", row.names=FALSE)
write.table(performance_icar, file="output/performance_icar.txt",
sep="\t", row.names=FALSE)
# Save for perc=10%
df_mod_valid <- rbind(performance_null[5,],performance_nochange[5,],
performance_glm[5,],
performance_icar[5,])
df_mod_valid <- df_mod_valid %>%
dplyr::mutate(mod=c("null","nochange","glm","icar")) %>%
dplyr::select(10,2:9)
write.table(df_mod_valid, file="output/df_mod_valid.txt", sep="\t", row.names=FALSE)
# Print
df_mod_valid## mod FOM OA EA Spe Sen TSS K AUC
## 1 null 0.33 0.50 0.5 0.50 0.50 0.00 0.00 0.50
## 2 nochange 0.00 0.50 0.5 1.00 0.00 0.00 0.00 0.50
## 3 glm 0.45 0.62 0.5 0.62 0.62 0.23 0.23 0.66
## 4 icar 0.58 0.73 0.5 0.73 0.73 0.47 0.47 0.81
We use the glm model to predict the spatial probability of deforestation.
if (!file.exists("output/prob_glm.tif")) {
fig_prob <- far$predict_raster(
mod_glm, var_dir="data/models/2010-2017",
input_forest_raster="data/forest/for2010.tif",
output_file="output/prob_glm.tif",
blk_rows=128L, transform=TRUE)
}We use the glm model to predict the forest cover in 2050.
We compute the differences between glm and icar models for forest cover predictions in 2050.
if (!file.exists("output/diff2050_icar_glm.tif")) {
far$r_diffproj(inputA="output/proj2050_icar.tif",
inputB="output/proj2050_glm.tif",
output_file="output/diff2050_icar_glm.tif",
blk_rows=128L)
}We plot the difference with function far.plot.differences().
if (!file.exists("output/diff2050_icar_glm.png")) {
far$plot$differences("output/diff2050_icar_glm.tif",
borders="data/mada/mada38s.shp",
output_file="output/diff2050_icar_glm.png")
}We can also plot with ggplot2
r_diff <- resamp2df(input_file="output/diff2050_icar_glm.tif",
output_file="output/diff2050_icar_glm_1km.tif",
res=1000)
mada <- rgdal::readOGR(dsn="data/mada",layer="mada38s")## OGR data source with driver: ESRI Shapefile
## Source: "/home/ghislain/Code/forestatrisk-tutorial/data/mada", layer: "mada38s"
## with 83 features
## It has 4 fields
rect_df <- data.frame(xmin=c(346000,793000), xmax=c(439000,886000),
ymin=c(7387000,7815000), ymax=c(7480000,7908000),
id=c(1,2))
p <- diff_plot(input_df=r_diff,
input_vector=mada,
output_file="output/diff2050_icar_glm_ggplot.png",
rect=rect_df)## Results plotted to "output/diff2050_icar_glm_ggplot.png"
Liu, Canran, Matt White, and Graeme Newell. 2011. “Measuring and Comparing the Accuracy of Species Distribution Models with Presence-Absence Data.” Ecography 34 (2): 232–43. https://doi.org/10.1111/j.1600-0587.2010.06354.x.
Pontius, Robert, Wideke Boersma, Jean-Christophe Castella, Keith Clarke, Ton de Nijs, Charles Dietzel, Zengqiang Duan, et al. 2008. “Comparing the Input, Output, and Validation Maps for Several Models of Land Change.” The Annals of Regional Science 42 (1). Springer Berlin / Heidelberg: 11–37. https://doi.org/10.1007/s00168-007-0138-2.
Vieilledent, Ghislain, Clovis Grinand, Fety A. Rakotomalala, Rija Ranaivosoa, Jean-Roger Rakotoarijaona, Thomas F. Allnutt, and Frédéric Achard. 2018. “Combining Global Tree Cover Loss Data with Historical National Forest Cover Maps to Look at Six Decades of Deforestation and Forest Fragmentation in Madagascar.” Biological Conservation 222: 189–97. https://doi.org/10.1016/j.biocon.2018.04.008.
Vieilledent, Ghislain, Cory Merow, Jérôme Guélat, Andrew M. Latimer, Marc Kéry, Alan E. Gelfand, Adam M. Wilson, Frédéric Mortier, and John A. Silander Jr. 2014. hSDM: hierarchical Bayesian species distribution models. https://doi.org/10.5281/zenodo.594920.